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We analyze correlations between the first letter of the name of an author and the number of 
citations their papers receive. We look at simple mean counts, numbers of highly-cited papers, and 
normalized h-indices, by letter. To our surprise, we conclude that orthographically senior authors 
produce a better body of work than their colleagues, despite some evidence of discrimination against 
them. 



I. INTRODUCTION 

Citation counts and indices based on them are a key 
aspect of the sociology and practice of astrophysics, and 
of many other fields. Various metrics to associate the ci- 
tation counts of an author with their value as a scientist 
have been developed, such as the h-index, h-b-index, and 
g-index[l[, though all have been criticized as inappropri- 
ate in different contexts. Indeed, any metric based solely 
on citation counts is ultimately flawed, since it is unable 
to distinguish between citations of the form "Work in 
forms the basis of this discussion" and "We find the 
results in Q to be utterly incorrect". Some unscrupu- 
lous authors even inappropriately cite their own work to 
improve their indices. 

Despite their flaws, citation counts and indices cer- 
tainly do have an important impact, both on science and 
scientists. Highly cited papers become part of the canon 
of standard works, and the authors of highly-cited papers 
are employed and feted 1 . Characterizations of patterns 
of citation can therefore be both metrics for wider socio- 
scientific trends and indicators of surprising aspects of 
paper generation, publication and promulgation systems. 

There have been previous studies of the correlation be- 
tween number of citations and the length Q, number of 
authors [5[ , release timing field Hi and publication 

method [9] of articles. Here we study another aspect of 
citation patterns: correlation between the number of ci- 
tations received by an article and its orthography. Specif- 
ically, we correlate the number of citations a paper has 
with the first letter of its first author's surname. Se- 
lection by paper rather than by author does pose some 



An earlier version of this manuscript misspelled this word as 
fetid; we apologize for any confusion caused. 



problems, but it is significantly more tractable to collect 
such data given inconsistency in citation names. 

In Section |H] we discuss the hypotheses we wish to test 
and the data used. In Section IIIII we analyze our data 
and discuss various citation measures. We conclude in 
Section [TV] 



II. HYPOTHESES & DATA 

We can immediately lay out a near-exhaustive hypoth- 
esis space of this problem: 

HO Authors with names near the end of the alphabet 
(AWNNTEOTAs) have more citations, because of 
some intrinsic superiority 2 . 

HI AWNNTEOTAs have fewer citations, because they 
are discriminated against. 

We have omitted a third possibility, that there is no sig- 
nificant difference in the citation counts, owing to its ev- 
ident implausibility. 

To test these hypotheses we use data from the 
SAO/NASA Astrophysics Data System (ADS). We use 
data from 12 randomly selected months, extracted using 
that site's query tool: we list all the astrophysics pa- 
pers published in that month. We extract the number 
of citations to each article whenever that information is 
available. The months selected are listed in Table HI 



2 We use the word superiority here somewhat tautologously, sim- 
ply to indicate the number of citations received. We do not 
intend to imply any superiority in ability, intellect or physical 
appearance. 
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TABLE I: Months analyzed. 



Month 


Year 


Month 


Year 


June 


2000 


May 


2006 


November 


2000 


August 


2006 


December 


2001 


February 


2007 


April 


2002 


May 


2007 


August 


2002 


November 


2007 


January 


2005 


February 


2009 



We reduce the data such that for each letter of the al- 
phabet we have a list of all the citation counts of papers 
written by authors whose name starts with that letter. 
For some months the more exclusive letters have no first- 
author publications. The months cover a long enough 
period that both new papers and more highly cited pa- 
pers are included. 




Alphabetic 



FIG. 1: Mean citation counts with alphabetic position and 
time. Horizontal location indicates alphabetic position and 
the color indicates time. The stars are individual data points 
and the solid lines are linear trends for each month. 



III. ANALYSIS 

In this section we consider how various citation mea- 
sures correlate with alphabetic position. We need to be 
careful here to tease out effects that arise from correlation 
between position and popularity of a letter; rigorous sta- 
tistical techniques are imperative. To efficiently and un- 
derstandably analyze our data we adopt the usual astro- 
physical paradigm: Bayesio-frequentist statistics, where 
frequentist methods are used and the results interpreted 
as though they were Bayesian probabilities. We also fol- 
low usual practice and use a number of different estima- 
tors, continuing until we find one that can demonstrate 
the correct hypothesis to be true. 



A. Mean citation counts 

The simplest metric we can use is just the mean cita- 
tion count for a letter. These values are plotted in Figure 
[TJ and the trend-line coefficients shown in Table QTJ The 
general trend of decreasing citations with time simply 
indicates the lack of time for recent papers to accumu- 
late citations. Disturbingly, these data seem to prefer 
hypothesis HI, since they show a decrease in mean cita- 
tions counts with alphabetic position, nearly consistently 
across the months surveyed. 

As noted in Table [TJ only one month shows a posi- 
tive trend with alphabetic seniority. We show statistics 
from this month, January 2005, in more detail in Fig- 
ure [2j Inspection of the plot suggests that the different 
statistical behavior that month was largely due to an 
impressive performance by the letter "S" and below-par 
performances by "A" and "B" . 



TABLE II: Linear fit coefficients for c — ax + b where c is the 
mean number of citations for a letter with alphabetic position 
x. Only one month, January 2005, shows the expected trend 
of more citations for orthographically advanced authors. 



Month 


a 


b 


06/00 


-0.033 


21.322 


11/00 


-0.738 


30.523 


12/01 


-0.317 


22.096 


04/02 


-0.229 


21.395 


08/02 


-0.199 


22.093 


01/05 


0.181 


13.925 


05/06 


-0.169 


14.975 


08/06 


-0.270 


9.127 


02/07 


-0.159 


11.818 


05/07 


-0.173 


11.993 


11/07 


-0.130 


11.008 


02/09 


-0.203 


8.442 



B. Highly Cited Papers 

Unlike real doctors, scientists are forgiven their worst 
work and judged on their best. The number of uncited 
papers 3 produced by an author is irrelevant if they have 
any highly cited ones. We therefore consider the most 
highly cited papers over all the data. We choose the top 
5% of papers as our benchmark for "highly cited" , since 
we need a large enough number to get reasonable statis- 



Paradox prevents us citing such a paper here. 
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C. Normalized h-indices 

The standard h-index cannot be applied to the col- 
lected data, since it does not take into account the num- 
ber of authors with a given initial letter. We therefore 
normalize it by the number of authors in the full data. 
The result for the collated complete data set, with all the 
months collected together, is shown in Figure HI 

As shown by the trend line, the result clearly sup- 
ports hypothesis HO; AWNNTEOTAs have a higher nor- 
malized h-index. Furthermore, the same effect appears 
whether we look at the bulk of the data points or consider 
only the extremal values in the former or latter halves of 
the data. 



FIG. 2: Citation statistics for January 2005. The citation 
count axis is scaled and binned to aid comprehension; the axis 
progresses linearly between each tick mark. Color indicates 
the square-root of the number of papers in the bin for this 
month. 



tics. This typically means dozens of citations (for the 
very recent months) to hundreds or thousands of cita- 
tions (for the oldest months in our set). We must also 
normalize by the total number of authors with a given 
initial letter to account for the rarity of certain letters. 

The results are shown in Figure [31 and they show a 
clear trend downwards with the progression of the alpha- 
bet, once again supporting hypothesis HI. We therefore 
try another analysis method. 
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FIG. 4: The normalized h-index for our global data set 
with alphabetic position (blue points) and a trend line (black 
solid). Point errors are produced as for Figure [3] 
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FIG. 3: The number of highly cited papers produced with 
initial author letter, together with a trend line. Point errors 
are computed using a procedure adapted from supernova cos- 
mology, where the errors are altered until the reduced x 2 ~ 1- 



IV. DISCUSSION 

Based on the results generated, our third method of h- 
indices appears to be the most reliable; we are therefore 
reluctantly directed towards hypothesis HO, that ortho- 
graphically high-ranking authors produce a better global 
body of work than their less alphabetically gifted col- 
leagues. 

Two caveats apply to our selection of data. First, we 
are somewhat restrictive by limiting ourself to first au- 
thors only, particularly in light of the practice of alpha- 
betizing authors on some papers, though this is more 
common for authors after the first. It is possible that 
any discrimination against or superiority of AWNNTEO- 
TAs occurs at the point of choosing which author should 
be first. Second, we use the data themselves to estimate 
the raw number of authors with a given initial letter. 
This might skew results, if, for example, there are large 
numbers of authors with the surname "Aaaronson" who 



4 



produced no papers at all during the referenced period; 
this would make our conclusions stronger. 

While future work on this topic could possibly im- 
prove on the analysis and interpretation methods we 
have used here, our conclusions can be made signif- 
icantly more robust by improving the data directly 

m m 123 mi- 
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